library(tidyverse)
library(janitor)
library(readxl)
#library(tidylog)
library(skimr)
library(knitr)

This week’s tidytuesday data comes from Pro Football Reference and includes attendance, standings, and game stats for each game. Well do a quick EDA and generate a few ideas of what might be interesting to look at.

# Import the data from tidytuesday: https://github.com/rfordatascience/tidytuesday

attendance <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/attendance.csv')
standings <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/standings.csv')
games <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/games.csv')

First, lets take a look at the data and see what it can tell us at face value.


The attendance dataset


team team_name year total home away week weekly_attendance
Arizona Cardinals 2000 893926 387475 506451 1 77434
Arizona Cardinals 2000 893926 387475 506451 2 66009
Arizona Cardinals 2000 893926 387475 506451 3 NA
Arizona Cardinals 2000 893926 387475 506451 4 71801
Arizona Cardinals 2000 893926 387475 506451 5 66985
Arizona Cardinals 2000 893926 387475 506451 6 44296
Data summary
Name Piped data
Number of rows 10846
Number of columns 8
_______________________
Column type frequency:
character 2
numeric 6
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
team 0 1 5 13 0 32 0
team_name 0 1 4 10 0 32 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1.00 2009.53 5.75 2000 2005.0 2010 2015.00 2019 ▇▇▇▇▇
total 0 1.00 1080910.03 72876.97 760644 1040509.0 1081090 1123230.00 1322087 ▁▁▇▆▁
home 0 1.00 540455.01 66774.65 202687 504360.0 543185 578342.00 741775 ▁▁▅▇▁
away 0 1.00 540455.01 25509.33 450295 524974.0 541757 557741.00 601655 ▁▂▇▇▂
week 0 1.00 9.00 4.90 1 5.0 9 13.00 17 ▇▆▆▆▇
weekly_attendance 638 0.94 67556.88 9022.02 23127 63245.5 68334 72544.75 105121 ▁▁▇▃▁

We certainly have some interesting statistics here to mess with. It looks like the ‘total,’ ‘home,’ and ‘away’ attendance columns represent the entire season, while the ‘weekly_attendance’ field is for each week. Therefore, it looks like each row is a game. So there should be some variability in how many rows there are amongst the teams, as some make it to the playoff and some do not (therefore they’d have more games). A look at the summary statistics should tell us more… if that is the case or not.

So, as was expected, there does seem to be some variability associated with how many games they play. Not to worry, I’ll probably stay away from analyzing just that, and focus on either more aggregated descriptives or simply other statistics herein.


The standings dataset


team team_name year wins loss points_for points_against points_differential margin_of_victory strength_of_schedule simple_rating offensive_ranking defensive_ranking playoffs sb_winner
Miami Dolphins 2000 11 5 323 226 97 6.1 1.0 7.1 0.0 7.1 Playoffs No Superbowl
Indianapolis Colts 2000 10 6 429 326 103 6.4 1.5 7.9 7.1 0.8 Playoffs No Superbowl
New York Jets 2000 9 7 321 321 0 0.0 3.5 3.5 1.4 2.2 No Playoffs No Superbowl
Buffalo Bills 2000 8 8 315 350 -35 -2.2 2.2 0.0 0.5 -0.5 No Playoffs No Superbowl
New England Patriots 2000 5 11 276 338 -62 -3.9 1.4 -2.5 -2.7 0.2 No Playoffs No Superbowl
Tennessee Titans 2000 13 3 346 191 155 9.7 -1.3 8.3 1.5 6.8 Playoffs No Superbowl
Data summary
Name Piped data
Number of rows 638
Number of columns 15
_______________________
Column type frequency:
character 4
numeric 11
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
team 0 1 5 13 0 32 0
team_name 0 1 4 10 0 32 0
playoffs 0 1 8 11 0 2 0
sb_winner 0 1 12 13 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1 2009.53 5.76 2000.0 2005.00 2010.0 2014.75 2019.0 ▇▇▇▇▇
wins 0 1 7.98 3.08 0.0 6.00 8.0 10.00 16.0 ▂▆▇▆▂
loss 0 1 7.98 3.08 0.0 6.00 8.0 10.00 16.0 ▂▆▇▆▂
points_for 0 1 350.28 71.40 161.0 299.00 348.0 396.00 606.0 ▂▇▇▂▁
points_against 0 1 350.28 59.55 165.0 310.00 347.0 391.50 517.0 ▁▃▇▆▁
points_differential 0 1 0.00 101.09 -261.0 -75.00 1.5 72.75 315.0 ▂▆▇▅▁
margin_of_victory 0 1 0.00 6.32 -16.3 -4.70 0.1 4.57 19.7 ▂▆▇▅▁
strength_of_schedule 0 1 0.00 1.63 -4.6 -1.10 0.0 1.20 4.3 ▁▅▇▅▁
simple_rating 0 1 0.00 6.20 -17.4 -4.47 0.0 4.50 20.1 ▁▆▇▅▁
offensive_ranking 0 1 0.00 4.34 -11.7 -3.18 0.0 2.70 15.9 ▁▇▇▂▁
defensive_ranking 0 1 0.00 3.57 -9.8 -2.40 0.1 2.50 9.8 ▁▅▇▅▁

So, look at this snapshot of the data, there seems to be all kinds of interesting statistics between attendance and what appears to be by team year statistics.


The games dataset


year week home_team away_team winner tie day date time pts_win pts_loss yds_win turnovers_win yds_loss turnovers_loss home_team_name home_team_city away_team_name away_team_city
2000 1 Minnesota Vikings Chicago Bears Minnesota Vikings NA Sun September 3 13:00:00 30 27 374 1 425 1 Vikings Minnesota Bears Chicago
2000 1 Kansas City Chiefs Indianapolis Colts Indianapolis Colts NA Sun September 3 13:00:00 27 14 386 2 280 1 Chiefs Kansas City Colts Indianapolis
2000 1 Washington Redskins Carolina Panthers Washington Redskins NA Sun September 3 13:01:00 20 17 396 0 236 1 Redskins Washington Panthers Carolina
2000 1 Atlanta Falcons San Francisco 49ers Atlanta Falcons NA Sun September 3 13:02:00 36 28 359 1 339 1 Falcons Atlanta 49ers San Francisco
2000 1 Pittsburgh Steelers Baltimore Ravens Baltimore Ravens NA Sun September 3 13:02:00 16 0 336 0 223 1 Steelers Pittsburgh Ravens Baltimore
2000 1 Cleveland Browns Jacksonville Jaguars Jacksonville Jaguars NA Sun September 3 13:02:00 27 7 398 0 249 1 Browns Cleveland Jaguars Jacksonville

From this vantage, it looks like I could connect the team name from the attendance data to the home team of the games data. To do so, I’ll have to make a key that matches. Most readily, it looks like I’ll need to combine a few the ‘team’ and ‘team_name’ columns in the attendance data.

year total home away week weekly_attendance t_name away_team winner tie day date time pts_win pts_loss yds_win turnovers_win yds_loss turnovers_loss home_team_name home_team_city away_team_name away_team_city
2000 893926 387475 506451 1 77434 Arizona Cardinals NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2000 893926 387475 506451 2 66009 Arizona Cardinals Dallas Cowboys Arizona Cardinals NA Sun September 10 20:35:00 32 31 322 1 330 2 Cardinals Arizona Cowboys Dallas
2000 893926 387475 506451 3 NA Arizona Cardinals NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2000 893926 387475 506451 4 71801 Arizona Cardinals Green Bay Packers Green Bay Packers NA Sun September 24 16:06:00 29 3 455 1 209 4 Cardinals Arizona Packers Green Bay
2000 893926 387475 506451 5 66985 Arizona Cardinals NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2000 893926 387475 506451 6 44296 Arizona Cardinals Cleveland Browns Arizona Cardinals NA Sun October 8 16:15:00 29 21 315 2 240 0 Cardinals Arizona Browns Cleveland

At first glance it looks like there is a bunch of missing data, but because I merged on the home team, almost all the missingness appears to be related to the team is away or if they had a by week. Which makes it less alarming… I was alarmed at first… “did my join work?”

Nevertheless, I think we’re almost at a point to start visualizing stuff. It would be helpful to have a column identifying if a team is home or away… just so I can sort on it later.

So, by doing all of this, I can see trends across time of how well a team does in terms of attendance between home and away games.

Lets take a look.

Well, that’s interesting… but I wonder if the obvious dip in at home attendance between ~2007 and ~2013 is simply a localized artifact to the Chiefs.

So obviously, each team has a fairly unique pattern of attendance, especially between home and away games. You can also see where some teams got a new stadium that pushed attendance up, such as the Dallas Cowboys.

Although both the previous graph do not exhibit any clear reoccuring or dicernable trends, it makes me wonder what the relationship between home attendance and season performance is. Said another way, does a strong home team crowd during the first half of the season, relate to eventual success?

Lets take a look.

## [1] 0.09888602

So much variability. I would assume this is because each team has different stadium sizes, and therefore the y-axis is not relative across teams. Lets see if we can control this.

Let’s first create a function to normalize data within group (year, team)

Group it and then lets see how things relate.

Lets take a look now.

Although first half season attendance does not seem to be extensively predictive, perhaps there are other indicators that could help improve this model. Lets look a bit deeper here.

## [1] 0.0536111
## # A tibble: 2 x 5
##   term           estimate std.error statistic   p.value
##   <chr>             <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)    5.01e- 1  7.66e- 3     65.4  1.72e-284
## 2 avg_first_half 8.59e+14  6.34e+14      1.35 1.76e-  1
## # A tibble: 1 x 11
##   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>  <dbl> <dbl> <dbl>
## 1   0.00287       0.00131 0.193      1.83   0.176     2   146. -286. -273.
## # … with 2 more variables: deviance <dbl>, df.residual <int>
## # A tibble: 4 x 5
##   term                 estimate std.error statistic p.value
##   <chr>                   <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)          8.43e- 1  2.70e+ 0     0.312   0.755
## 2 year                -1.70e- 4  1.34e- 3    -0.126   0.899
## 3 avg_first_half       1.15e+17  2.34e+17     0.491   0.624
## 4 year:avg_first_half -5.67e+13  1.16e+14    -0.487   0.626
## # A tibble: 1 x 11
##   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>  <dbl> <dbl> <dbl>
## 1   0.00325      -0.00147 0.193     0.689   0.559     4   146. -283. -260.
## # … with 2 more variables: deviance <dbl>, df.residual <int>
## # A tibble: 4 x 5
##   term            estimate std.error statistic   p.value
##   <chr>              <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)     8.60e+ 0  1.81e+ 0      4.75 2.55e-  6
## 2 year           -4.39e- 3  9.04e- 4     -4.85 1.53e-  6
## 3 avg_first_half  4.38e+14  4.26e+14      1.03 3.04e-  1
## 4 points_for      2.03e- 3  7.29e- 5     27.9  2.68e-112
## # A tibble: 1 x 11
##   r.squared adj.r.squared sigma statistic   p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>     <dbl> <int>  <dbl> <dbl> <dbl>
## 1     0.552         0.550 0.129      261. 3.77e-110     4   402. -793. -771.
## # … with 2 more variables: deviance <dbl>, df.residual <int>